Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
نویسندگان
چکیده
Medical vision-and-language pre-training provides a feasible solution to extract effective representations from medical images and texts. However, few studies have been dedicated this field facilitate understanding. In paper, we propose self-supervised learning paradigm with multi-modal masked autoencoders (M $$^3$$ AE), which learn cross-modal domain knowledge by reconstructing missing pixels tokens randomly There are three key designs make simple approach work. First, considering the different information densities of vision language, adopt masking ratios for input image text, where considerably larger ratio is used images. Second, use visual textual features layers perform reconstruction deal levels abstraction in language. Third, develop language decoders (i.e., Transformer multi-layer perceptron language). To comprehensive evaluation further research, construct benchmark including tasks. Experimental results demonstrate effectiveness our approach, state-of-the-art achieved on all downstream Besides, conduct analysis better verify components various settings pre-training. The source code available at https://github.com/zhjohnchan/M3AE .
منابع مشابه
Pre-Training CNNs Using Convolutional Autoencoders
Despite convolutional neural networks being the state of the art in almost all computer vision tasks, their training remains a difficult task. Unsupervised representation learning using a convolutional autoencoder can be used to initialize network weights and has been shown to improve test accuracy after training. We reproduce previous results using this approach and successfully apply it to th...
متن کاملFaster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training
Deep neural networks are capable of modelling highly nonlinear functions by capturing different levels of abstraction of data hierarchically. While training deep networks, first the system is initialized near a good optimum by greedy layer-wise unsupervised pre-training. However, with burgeoning data and increasing dimensions of the architecture, the time complexity of this approach becomes eno...
متن کاملConvergence of gradient based pre-training in Denoising autoencoders
The success of deep architectures is at least in part attributed to the layer-by-layer unsupervised pre-training that initializes the network. Various papers have reported extensive empirical analysis focusing on the design and implementation of good pre-training procedures. However, an understanding pertaining to the consistency of parameter estimates, the convergence of learning procedures an...
متن کاملPre-training of Recurrent Neural Networks via Linear Autoencoders
We propose a pre-training technique for recurrent neural networks based on linear autoencoder networks for sequences, i.e. linear dynamical systems modelling the target sequences. We start by giving a closed form solution for the definition of the optimal weights of a linear autoencoder given a training set of sequences. This solution, however, is computationally very demanding, so we suggest a...
متن کاملLanguage Resources for Multi-Modal Dialogue Systems
This paper reviews a resource base of software agents for hub-based architectures, which can be used generally for advanced dialogue systems research and deployment. The problem of domain-specificity of dialogue managers is discussed, and we describe an approach to it developed at CSLI, involving a domain-general dialogue manager with application specific “Activity Models”. We also describe rel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2022
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-16443-9_65